Skip to content

Conversation

@benclifford
Copy link
Collaborator

This should not change behaviour and only move code around and add in more wiring.

This is working towards all of checkpointing/memoizer code being plugable, with the DFK having no knowledge of the specifics - for example, the DFK no longer imports pickle, because that is now the business of the memoizer.

Because of that goal, the parameters for Memoizer are arranged in two ways:

Constructor parameters for Memoizer are parameters that the future user will supply at configuration when Memoizer is exposed as a user plugin, like other plugins.

DFK internals that should be injected into the (eventually pluggable) memoizer (small m) are set as startup-time attributes, as for other plugins.

Right now these two things happen right next to each other, but the Memoizer constructor call will move into user configuration space in a subsequent PR.

As before this PR, there is still a separation between "checkpointing" and "memoization" that is slightly artificial. Subsequent PRs will merge these actions together more in this Memoizer implementation.

This PR preserves a user-facing dfk.checkpoint() call, that becomes a passthrough to Memoizer.checkpoint(). I think that is likely to disappear in a future PR: manual checkpoint triggering will become a feature (or non-feature) of a specific memoizer implementation and so a method directly on that memoizer (or not).

To go alongside the existing update_memo call, called by the DFK when a task is ready for memoization, this PR adds update_checkpoint, which captures the slightly different notion of a task being ready for checkpointing -- the current implementation can memoize a task that is not yet complete because it memoizes the future, not the result, while a checkpoint needs the actual result to be ready and so must be called later. This latter call happens in the race-condition-prone handle_app_update. A later PR will remove this distinction and move everything to around the same place as update_memo. See PR #3979 for description of fixing a related race condition related to update_memo.

See PR #3535 for broader context.

Changed Behaviour

none

Type of change

  • New feature

This should not change behaviour and only move code around and
add in more wiring.

This is working towards all of checkpointing/memoizer code being
plugable, with the DFK having no knowledge of the specifics - for
example, the DFK no longer imports `pickle`, because that is now the
business of the memoizer.

Because of that goal, the parameters for Memoizer are arranged
in two ways:

Constructor parameters for Memoizer are parameters that the future user
will supply at configuration when Memoizer is exposed as a user plugin,
like other plugins.

DFK internals that should be injected into the (eventually pluggable)
memoizer (small m) are set as startup-time attributes, as for other
plugins.

Right now these two things happen right next to each other, but the
Memoizer constructor call will move into user configuration space in
a subsequent PR.

As before this PR, there is still a separation between "checkpointing"
and "memoization" that is slightly artificial. Subsequent PRs will
merge these actions together more in this Memoizer implementation.

This PR preserves a user-facing dfk.checkpoint() call, that becomes a
passthrough to Memoizer.checkpoint().  I think that is likely to
disappear in a future PR: manual checkpoint triggering will become a
feature (or non-feature) of a specific memoizer implementation and so
a method directly on that memoizer (or not).

To go alongside the existing update_memo call, called by the DFK when
a task is ready for memoization, this PR adds update_checkpoint, which
captures the slightly different notion of a task being ready for
checkpointing -- the current implementation can memoize a task that is
not yet complete because it memoizes the future, not the result, while
a checkpoint needs the actual result to be ready and so must be called
later. This latter call happens in the race-condition-prone
handle_app_update. A later PR will remove this distinction and move
everything to around the same place as update_memo. See PR #3979 for
description of fixing a related race condition related to update_memo.

See PR #3535 for broader context.
@khk-globus khk-globus force-pushed the benc-checkpoint-rearrange branch from 42e7c51 to e81d668 Compare October 13, 2025 14:23
Copy link
Collaborator

@khk-globus khk-globus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like, as you say, this is basically just moving code. One question about the run_dir, but the others look to effectively be comments on years-old code.

Comment on lines +152 to +153
run_dir: str

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make run_dir an optional instance member? For example, if checkpoint_files and checkpoint_mode are the right state and the consumer forgets to externally set it before calling .start() ... seems like a potential traceback?

As opposed to, say, choosing a default value, or ensuring that the mistake is caught at instantiation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's two initializations in the parsl style of doing configuration/plugins: the user creates the object, and so user-specified parameters go into __init__. Secondly, there's the framework that is the DFK that invents runtime stuff like the rundir. By the time this happens, __init__ is long in the past - perhaps in a different source file and well before the DFK even begins to exist.

So it's probably right to make this be mandatory parameter of start().

The plugin style in general doesn't do well in differentiating between "uninitialised, configured" (pre-start) plugins and "intialised" (post-start) objects. That differentiation was something that was forced with monitoring plugin radios, because the "configured, uninitialized" state needs to be moved around over the network. And if I was going to re-implement this object model, I'd probably try to make that separation happen everywhere. But that's not how Parsl is right now.

Comment on lines +197 to +199
if self.checkpoint_period is None:
raise ConfigurationError("Checkpoint period must be specified with periodic checkpoint mode")
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the checkpoint_period attribute can be changed later? In some sense "too bad," as this check is nominally not needed after instantiation or comes much later than would be ideal (i.e., "immediately").

Meanwhile, stylistically, given the [no-return] semantic of the raise, could remove the else: and unindent the block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably weird to change it at all, and the validation can go into the __init__ of Memoizer I think. But this PR was trying to move blocks as big as possible and the subsequent timer start can't happen until start time.

Comment on lines +202 to +203
except Exception:
raise ConfigurationError("invalid checkpoint_period provided: {0} expected HH:MM:SS".format(self.checkpoint_period))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any sense in raise ... from ...? Given the not None check and typing, .split() is guaranteed to work, and the only failures would be an invalid int or not enough items for the destructure. (If interested, could also imit the number of splits to 2.)

It would be nice to ensure that the this field is valid much earlier. Perhaps in a setter-style setup?

@benclifford benclifford added this pull request to the merge queue Oct 15, 2025
Merged via the queue into master with commit 1aff212 Oct 15, 2025
11 checks passed
@benclifford benclifford deleted the benc-checkpoint-rearrange branch October 15, 2025 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants